Participation Distribution in Committee Selection¶
Executive Summary¶
In the following computer experiments, we aim to understand the distribution of selections in a committee when varying sizes of the participant pool of SPOs and the committee. We show that the "pigeonhole principle" helps us interpret the results and understand the finite distribution of the committee seats assigned to participants as a function of stake, group, and committee sizes.
The experiment is designed to:
- Sample without replacement a group of participants from the population and
- Calculate the stake weight for each participant, which is the stake normalized over the group to sum to 1.
- Assign a committee of the fixed group size based on the stake weight of each using random selection with replacement.
- Analyze the relationship and distribution of committee selection with group size.
We conducted the experiments with varying sizes (100, 200, ..., 500) of groups and committees. The results are visualized through plots of committee assignments where we vary the group size to see how the committee selection and seat count changes.
The results show that some group members with smaller stake weights may not (ever?) get selected for committee seats. With repeated trials where a new committee is selected, called an epoch, and assuming nonzero stake weight, there is nonzero probability of selecting any participant in the long run. However, in the short term, there is a significant chance that some participants will not ever get selected, almost surely. This is a natural outcome of the selection process with a discrete and finite number of seats. This is a manifestation of this committee selection process as it currently stands.
# %%
# Load the required libraries
from participation_lib import (
np,
pd,
plt,
sns,
load_data,
get_stake_distribution,
assign_commitee,
simulate,
std_error,
plot_group_to_committee_index,
plot_selection_count_vs_stake,
plot_committee_selection_counts,
plot_committee_selection_seat_cutoff,
plot_participation,
)
# %%
# Load the Data: The population of registered SPOs
population = load_data("../data/pooltool-cleaned.csv")
print(population.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3056 entries, 0 to 3055 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 3056 non-null object 1 stake 3056 non-null int64 2 stake_percent 3056 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 71.8+ KB None
# %%
population.describe()
| stake | stake_percent | |
|---|---|---|
| count | 3.056000e+03 | 3056.000000 |
| mean | 7.305314e+06 | 0.032723 |
| std | 1.648449e+07 | 0.073839 |
| min | 0.000000e+00 | 0.000000 |
| 25% | 5.265000e+02 | 0.000002 |
| 50% | 5.692500e+04 | 0.000255 |
| 75% | 3.282500e+06 | 0.014703 |
| max | 1.054300e+08 | 0.472250 |
# %%
# Let's now sample a group of participants from the population
# and calculate the stake weight for each participant.
group_size = 100
group_stakes = get_stake_distribution(
population,
group_size=group_size,
num_iter=100,
plot_it=True,
)
print(group_stakes)
stake stake_weight 0 71397500.00 8.753869e-02 1 67897900.00 8.324792e-02 2 64359600.00 7.890970e-02 3 60630500.00 7.433754e-02 4 55516400.00 6.806727e-02 .. ... ... 95 17.54 2.150536e-08 96 10.50 1.287379e-08 97 5.98 7.331929e-09 98 3.50 4.291263e-09 99 1.75 2.145631e-09 [100 rows x 2 columns]
# %%
print(group_stakes.describe())
stake stake_weight count 1.000000e+02 1.000000e+02 mean 8.156108e+06 1.000000e-02 std 1.687489e+07 2.068988e-02 min 1.750000e+00 2.145631e-09 25% 1.981992e+03 2.430072e-06 50% 1.456142e+05 1.785339e-04 75% 5.105425e+06 6.259634e-03 max 7.139750e+07 8.753869e-02
# %%
# Let's now assign a committee of the fixed group_size
# based on the stake weight of each
results = assign_commitee(
group_stakes,
committee_size=group_size,
num_iter=1,
plot_it=True,
)
# %%
# Let's now create a plots of committee assignments where we vary
# the group size over {100, 200, 300, 400, 500} and see how the
# committee selection and seat count changes.
# Initialize Parameters:
# comm_sizes = [100] # vary over committee size, k
# group_sizes = [100] # vary over group size, n
comm_sizes = range(200, 501, 100) # vary over committee size, k
group_sizes = range(200, 501, 100) # vary over group size, n
num_iter = 1 # Number of iterations for Monte Carlo simulation
# Note that the number of iterations here can be interpreted as the number
# of selection rounds for the committee, which we call an epoch.
# If we have a new epoch per day, then 1000 iterations is about 3 years.
# %%
# Call the function
sim_results_df = simulate(
population,
comm_sizes,
group_sizes,
num_iter,
plot_it=True,
)
Committee Size = 200 Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Committee Size = 300 Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Committee Size = 400 Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Committee Size = 500 Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
# %%
# Extract the data for plotting
col_index = sim_results_df.columns
commitee_sizes = [
int(col.split("=")[1].strip()) for col in col_index.get_level_values(0).unique()
]
group_sizes = [
int(col.split("=")[1].strip()) for col in col_index.get_level_values(1).unique()
]
# Plot the percentage of group participants not selected for committee seats
plot_participation(sim_results_df, commitee_sizes, group_sizes, num_iter)
# %%
# Plot the committee selection counts distribution
fig = plt.figure(figsize=(12, 8))
plot_data = sim_results_df.loc["Committee Seats"].loc["mean"]
for c, g in plot_data.index:
y = plot_data.loc[(c, g)]
x = y.index
n_c = int(c.split("=")[1].strip())
n_g = int(g.split("=")[1].strip())
colors = sns.color_palette("tab20", len(plot_data.index))
color_idx = list(plot_data.index).index((c, g))
plt.bar(x, y, alpha=0.7, color=colors[color_idx], label=f"{n_c}, {n_g}")
plt.xlabel("Participant Index")
plt.ylabel("Committee Seat Count (average)")
plt.title("Committee Seat Count for Participants")
plt.legend(title="Committee Size, Group Size")
plt.xlim(0, 200)
plt.show()
# %%
# Distinct Voters
committee_voters = sim_results_df.loc["Distinct Voters"]
# Create a DataFrame row from the computed percentages
mean_values = committee_voters.loc["mean"]
std_dev_values = committee_voters.loc["sd"]
# Calculate the percentage of participants not selected for committee seats
print("Percentage of Group Participants Not Selected for Committee Seats:")
committee_participation = pd.concat([mean_values, std_dev_values], axis=1)
# committee_participation.columns = ["Mean", "Std Dev"]
print(committee_participation)
Percentage of Group Participants Not Selected for Committee Seats:
mean sd
Committee Size Group Size
Committee Size = 200 Group Size = 200 52.0 0.0
Group Size = 300 64.0 0.0
Group Size = 400 80.0 0.0
Group Size = 500 92.0 0.0
Committee Size = 300 Group Size = 200 57.0 0.0
Group Size = 300 75.0 0.0
Group Size = 400 84.0 0.0
Group Size = 500 99.0 0.0
Committee Size = 400 Group Size = 200 58.0 0.0
Group Size = 300 76.0 0.0
Group Size = 400 103.0 0.0
Group Size = 500 121.0 0.0
Committee Size = 500 Group Size = 200 65.0 0.0
Group Size = 300 81.0 0.0
Group Size = 400 101.0 0.0
Group Size = 500 127.0 0.0
# %%
# Let's now create a plots of committee assignments where we vary
# the group size over {100, 200, 300, 400, 500} and see how the
# committee selection and seat count changes.
# Initialize Parameters:
# comm_sizes = [100] # vary over committee size, k
# group_sizes = [100] # vary over group size, n
comm_sizes = range(100, 1201, 100) # vary over committee size, k
group_sizes = range(100, 1201, 100) # vary over group size, n
num_iter = 100 # Number of iterations for Monte Carlo simulation
# Note that the number of iterations here can be interpreted as the number
# of selection rounds for the committee, which we call an epoch.
# If we have a new epoch per day, then 1000 iterations is about 3 years.
# %%
# Call the function
sim_results_df = simulate(
population,
comm_sizes,
group_sizes,
num_iter,
plot_it=True,
)
Committee Size = 100 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 200 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 300 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 400 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 500 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 600 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 700 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 800 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 900 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 1000 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 1100 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
Committee Size = 1200 Group Size = 100
Group Size = 200
Group Size = 300
Group Size = 400
Group Size = 500
Group Size = 600
Group Size = 700
Group Size = 800
Group Size = 900
Group Size = 1000
Group Size = 1100
Group Size = 1200
# %%
# Extract the data for plotting
col_index = sim_results_df.columns
commitee_sizes = [
int(col.split("=")[1].strip()) for col in col_index.get_level_values(0).unique()
]
group_sizes = [
int(col.split("=")[1].strip()) for col in col_index.get_level_values(1).unique()
]
# Plot the percentage of group participants not selected for committee seats
plot_participation(sim_results_df, commitee_sizes, group_sizes, num_iter)
# %%
# Plot the committee selection counts distribution
fig = plt.figure(figsize=(12, 8))
plot_data = sim_results_df.loc["Committee Seats"].loc["mean"]
for c, g in plot_data.index:
y = plot_data.loc[(c, g)]
x = y.index
n_c = int(c.split("=")[1].strip())
n_g = int(g.split("=")[1].strip())
colors = sns.color_palette("tab20", len(plot_data.index))
color_idx = list(plot_data.index).index((c, g))
plt.bar(x, y, alpha=0.7, color=colors[color_idx], label=f"{n_c}, {n_g}")
plt.xlabel("Participant Index")
plt.ylabel("Committee Seat Count (average)")
plt.title("Committee Seat Count for Participants")
plt.legend(title="Committee Size, Group Size")
plt.xlim(0, 200)
plt.show()
/usr/local/lib/python3.11/site-packages/IPython/core/pylabtools.py:170: UserWarning: Creating legend with loc="best" can be slow with large amounts of data. fig.canvas.print_figure(bytes_io, **kw)
# %%
# Distinct Voters
committee_voters = sim_results_df.loc["Distinct Voters"]
# Create a DataFrame row from the computed percentages
mean_values = committee_voters.loc["mean"]
std_dev_values = committee_voters.loc["sd"]
# Calculate the percentage of participants not selected for committee seats
print("Percentage of Group Participants Not Selected for Committee Seats:")
committee_participation = pd.concat([mean_values, std_dev_values], axis=1)
# committee_participation.columns = ["Mean", "Std Dev"]
print(committee_participation)
Percentage of Group Participants Not Selected for Committee Seats:
mean sd
Committee Size Group Size
Committee Size = 100 Group Size = 100 25.91 2.025315
Group Size = 200 40.26 2.55194
Group Size = 300 50.18 3.191802
Group Size = 400 57.87 3.306524
Group Size = 500 62.37 3.110161
... ... ...
Committee Size = 1200 Group Size = 800 223.17 5.921241
Group Size = 900 244.46 6.25527
Group Size = 1000 261.56 5.846914
Group Size = 1100 279.83 6.926839
Group Size = 1200 298.7 7.054786
[144 rows x 2 columns]
# %%
# Prepare the DataFrame for concatenation with the other simulation results
committee_participation = committee_participation.T
committee_participation.index = pd.MultiIndex.from_tuples(
[("Committee Participation %", "mean"), ("Committee Participation %", "sd")]
)
# Concatenate this new row to the simulation results DataFrame
sim_results_df = pd.concat([sim_results_df, committee_participation], axis=0)
sim_results_df
| Committee Size | Committee Size = 100 | ... | Committee Size = 1200 | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Group Size | Group Size = 100 | Group Size = 200 | Group Size = 300 | Group Size = 400 | Group Size = 500 | Group Size = 600 | Group Size = 700 | Group Size = 800 | Group Size = 900 | Group Size = 1000 | ... | Group Size = 300 | Group Size = 400 | Group Size = 500 | Group Size = 600 | Group Size = 700 | Group Size = 800 | Group Size = 900 | Group Size = 1000 | Group Size = 1100 | Group Size = 1200 | |
| Distinct Voters | mean | 25.91 | 40.26 | 50.18 | 57.87 | 62.37 | 67.58 | 70.66 | 73.86 | 75.84 | 77.15 | ... | 106.02 | 131.7 | 156.54 | 179.34 | 201.49 | 223.17 | 244.46 | 261.56 | 279.83 | 298.7 |
| sd | 2.025315 | 2.55194 | 3.191802 | 3.306524 | 3.110161 | 3.672002 | 3.663932 | 3.487177 | 3.801894 | 3.235352 | ... | 4.14 | 4.670118 | 5.485289 | 5.680176 | 5.356295 | 5.921241 | 6.25527 | 5.846914 | 6.926839 | 7.054786 | |
| Committee Seats | mean | 0 9.41 1 9.02 2 7.95 3 7.48 4 ... | 0 4.96 1 4.66 2 4.39 3 4.6... | 0 3.52 1 3.26 2 2.93 3 2.8... | 0 2.38 1 2.25 2 2.13 3 2.2... | 0 1.99 1 1.92 2 2.09 3 1.7... | 0 1.61 1 1.70 2 1.44 3 1.7... | 0 1.52 1 1.54 2 1.34 3 1.2... | 0 1.09 1 1.12 2 1.17 3 1.1... | 0 1.02 1 1.10 2 0.94 3 1.0... | 0 1.14 1 0.77 2 0.93 3 1.0... | ... | 0 40.01 1 37.14 2 35.08 3 ... | 0 30.26 1 28.63 2 27.42 3 ... | 0 24.62 1 21.44 2 22.21 3 ... | 0 20.25 1 19.13 2 18.96 3 ... | 0 18.26 1 16.15 2 15.52 3 ... | 0 15.72 1 14.02 2 13.68 3 ... | 0 13.70 1 12.88 2 12.31 3 ... | 0 13.71 1 11.20 2 11.49 3 ... | 0 11.90 1 10.62 2 9.77 3 ... | 0 10.77 1 9.28 2 9.51 3 ... |
| Committee Participation % | mean | 25.91 | 40.26 | 50.18 | 57.87 | 62.37 | 67.58 | 70.66 | 73.86 | 75.84 | 77.15 | ... | 106.02 | 131.7 | 156.54 | 179.34 | 201.49 | 223.17 | 244.46 | 261.56 | 279.83 | 298.7 |
| sd | 2.025315 | 2.55194 | 3.191802 | 3.306524 | 3.110161 | 3.672002 | 3.663932 | 3.487177 | 3.801894 | 3.235352 | ... | 4.14 | 4.670118 | 5.485289 | 5.680176 | 5.356295 | 5.921241 | 6.25527 | 5.846914 | 6.926839 | 7.054786 | |
5 rows × 144 columns
# %%
# Save the results to an Excel file
output_file = "../data/participation_run_results.xlsx"
sim_results_df.to_excel(output_file)
print(f"Results saved to {output_file}")
Results saved to ../data/participation_run_results.xlsx